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A  major  problem  in  cluster  analysis  is  determining  the  number  of 
subpopulations  from  the  sample  data.   In  this  study,  it  is  assumed  that 
the  subpopulations  correspond  to  modes  of  the  population  density  function. 
The  kth  nearest  neighbor  clustering  procedure,  which  is  known  to  be  set- 
consistent  for  high-density  clusters,  is  then  shown  to  be  useful  in 
providing:   (1)  a  diagnostic  plot  which  will  indicate  the  number  of  sub- 
populations  present,  and  (2)  a  bootstrap  procedure  for  testing  the 
existence  of  two  or  more  subpopulations.   The  performance  of  these  pro- 
cedures will  be  illustrated  by  real  examples. 


KEWORDS:   Modes;  kth  nearest  neighbor  clustering;  diagnostic  plot; 
hypothesis  testing;  bootstrap. 


1,   INTRODUCTION 

1. 1  Background 

A  recent  study  by  Blashfield  and  Aldenderfer  (1978)  shows  that  num- 
erous clustering  methods  have  been  developed  in  the  past  two  decades.   A 
review  of  many  of  these  techniques  can  be  found  in  Cormack  (1971) , 
Anderberg  (1973),  Sneath  and  Sokal  (1973),  Everitt  (1974),  Hartigan  (1975), 
and  Spath  (1980).   The  validity  of  the  sample  clusters  obtained  by  these 
methods  is  always  questionable,  however,  due  to  the  lack  of  development 
in  the  probabilistic  and  statistical  aspects  of  clustering  methodology. 
Consequently,  the  existing  clustering  procedures  are  often  regarded  as 
heuristics  generating  artificial  clusters  from  a  given  set  of  sample  data, 
and  there  is  a  need  of  clustering  procedures  that  are  useful  for  drawing 
statistical  inferences  about  the  underlying  population  from  a  sample. 
In  this  paper,  we  consider  the  important  problem  of  assessing  and  testing 
the  number  of  "clusters"  or  "subpopulations"  present  in  the  population. 

1.2  Statistical  Inference  Under  the  Density-contour  Clustering  Model 

In  this  study,  we  assume  that  the  clustering  data  consist  of  a  sample 
from  a  distribution   F  with  density  function   f,   on  which  population 
clusters  are  defined  by  a  clustering  model.   The  clustering  model  that 
will  be  used  here  is  the  "density-contour"  model  given  in  Hartigan  (1975) 
and  Wong  and  Lane  (1981).   Using  this  model,  the  true  population  clusters 
can  be  defined  on   f   as  follows:   for  all   f*  ^'0,   a  density-contour 
cluster  at  level   f*   in  the  population  is  defined  as  a  maximal  connected 


set   of   the    form      {x     i    f(x)    >    f'-}.      The    family      T     of   such   clusters 
forms    a   tree   in    the   sense    that      AeT,      BeT     implies   either     A^B,      B^A,    or 
AoB  =   (|) ,      the   empty   set. 

A  hierarchical   clustering  procedure,   which  produces   a  sample    cluster- 
ing   tree        T  on    the   observations      X   , ...,X        may    then   be   evaluated  by 
examining  whether     T„      converges    to     T     with   probability   one  when     N 

to  be   strongly   set-consistent    for   density-contour   clusters    (or     T)      if 
for   any      A,    BeT,      AnB  =    (J), 


{   A^    n  B^  =   ^   as  N  -  00}   =    1, 


implies      A,^3B    ,      this    limit   result   means    that    the    tree    relationship   in      T 
converges    strongly    to    the    tree    relationship    in   T.      This    consistent    cluster- 
estimation   problem  under   the   density-contour   clustering  model  has   been 
addressed  by  Hartigan    (1981)    and  Wong  and  Lane    (1981).      Hartigan    (1981) 
has    shown    that   most    of    the   best   knotvm   hierarchical   clustering   methods    are 
not   set-consistent,   while  Wong  and  Lane    (1981)    developed   a  kth  nearest 
neighbor   clustering   procedure  which    is    strongly    set-consistent    for   density- 
contour   clusters. 

The   problem  of   hypothesis    testing   under    the    density-contour    cluster- 
ing model    did  not    receive   much    attention.       (See   however,    Hartigan    (1977) 
for    a   discussion    of    the    DIP    statistic    for    testing   bimodality.)      One    import- 
ant   feature    of    the   density    function      f      under    the    density-contour    clustering 


model  is  the  modes  of   f,   each  of  which  is  the  limit  of  a  decreasing 
sequence  of  density-contour  clusters.   In  this  paper,  it  is  assumed  that 
any  subpopulation  in  the  population  corresponds  to  a  mode  in  the  density 
function   f.   Our  aim  is  to  develop  procedures  that  are  useful  for 
assessing  and  testing  the  number  of  modes  present  in   f.   It  will  be  shown 
that  the  kth  nearest  neighbor  clustering  procedure  given  in  Wong  and  Lane 
(1981)  is  useful  in  providing  (i)  a  diagnostic  plot  for  assessing  the 
number  of  modes,  and  (ii)  a  statistic  for  testing  multimodality. 

Using  the  above  formulation,  the  statistical  problem  being  considered 
is  that  of  determining  the  number  of  modes  in  the  underlying  density   f. 
A   brief  review  of  the  literature  on  testing  for  modes  will  be  given  in 
Section  2.   In  Section  3,  the  kth  nearest  neighbor  clustering  procedure 
will  first  be  reviewed,  and  then  it  will  be  shown  how  a  diagnostic  plot 
based  on  this  procedure  can  be  constructed  to  assess  multimodality.   A 
test  statistic  for  examining  multimodality  is  proposed  in  Section  4,  and 
it  will  also  be  shown  how  the  significance  level  of  a  sample  test  statistic 
can  be  estimated  by  using  the  bootstrap  procedure.   Generated  data  will  be 
used  to  illustrate  the  performances  of  these  procedures.   And  in  Section  5, 
the  practical  utility  of  the  proposed  procedures  are  demonstrated  by 
several  well-knoxTO  data  sets. 


2.   LITERATURE  REVIEW 

Several  authors  have  studied  the  problem  of  testing  for  clusters. 
In  Engelman  and  Hartigan  (1969)  and  Hartigan  (1978),  a  likelihood  ratio 
approach  to  the  problem  of  testing  whether  the  data  indicate  the  presence 
of  two  different  univariate  normal  populations  or  only  one  is  proposed. 
Multivariate  generalizations  of  Engelman  and  Hartigan 's  work  can  be  found 
in  Lee  (1979),  in  which  the  union-intersection  principle  of  test  construc- 
tion is  used  to  develop  a  multivariate  test  for  clusters.   One  major 
drawback  of  these  testing  procedures  is  that  they  are  based  on  a  restric- 
tive parametric  clustering  model,  where  clusters  are  assumed  to  be  com- 
ponents of  a  normal  mixture. 

In  this  paper,  the  nonparametric  density-contour  clustering  model 
is  used,  where  clusters  are  defined  by  the  density  contours  of  the  under- 
lying density  function.   Our  aim  is  to  develop  procedures  that  are  useful 
for  assessing  and  testing  multimodality.   In  the  clustering  literature, 
two  different  statistics  have  been  proposed  for  testing  bimodality  in  one 
dimension.   Kruskal's  test  given  in  Giacomelli  et.  al.  (1971)  is  based  on 
the  differences  between  order  statistics,  while  Hartigan's  (1977)  DIP 
statistic  looks  for  a  large  interval  between  two  sets  of  small  intervals 
in  the  minimum  spanning  tree  obtained  for  the  sample  observations.   The 
major  problem  encountered  in  using  these  test  statistics  is  the  selection 
of  an  appropriate  distribution  function  for  the  null  hypothesis.   The  uni- 
form and  normal  distributions  have  been  used  by  both  of  the  above  authors 
to  compute  the  sampling  distribution  of  the  proposed  statistics,  but  the 
appropriateness  of  using  these  null  distributions   in  cluster   anlysis 
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remains  questionable. 

Silverman  (1981)  proposed  a  statistic  for  testing  the  multimodality 
of  an  underlying  density  function   f  which  is  based  on  the  kernel 
density  estimate.   More  importantly,  he  proposed  an  intuitively  appealing 
bootstrap  procedure  for  estimating  the  significance  level  of  a  sample 
value  of  his  statistic,  without  having  to  use  the  uniform  or  normal  as  the 
null  distribution.   In  this  paper,  a  statistic  is  proposed  for  testing 
multimodality,  which  is  based  on  the  kth  nearest  neighbor  clustering  pro- 
cedure given  in  Wong  and  Lane  (1981) ,  and  it  will  be  shown  how  a  modified 
bootstrap  procedure  can  be  used  to  estimate  the  significance  level  of  a 
sample  value  of  this  statistic. 


3.   A  DIAGNOSTIC  PLOT  FOR  THE  NUMBER  OF  MODES 

3. 1  The  kth  Nearest  Neighbor  Clustering  Procedure 

In  this  section,  it  will  be  shown  that  the  kth  nearest  neighbor 
clustering  procedure  given  in  Wong  and  Lane  (1981)  can  be  used  to  provide 
a  diagnostic  plot  for  assessing  the  number  of  modes  in  a  density   f  using 
some  sample  data  X,,  X^ ,  ...,  X^^   from  f.   This  clustering  procedure  can 
be  described  as  follows: 

Step  1:    For  i  =  1,  2,  ...,  N,   compute   d,  (X.),   the  kth  nearest 
neighbor  distance  for  observation   X.. 

Step  2:    Compute  the  distance  matrix  D  as  follows: 


D(X. ,  X.)  =  0,   if  X.  =  X.  ; 


=    1/2    [d^(\)   +  d^a.)],      if     d'HX.,    X^)    <   d^(X.) 
or     d*(X,  ,    X.)    ^   d,  (X   ), 
where      d*     is    the   Euclidean      metric: 
=  «=,      otherwise. 

Step  3:  Apply  the  single  linkage  clustering  algorithm  to  the  computed 
distance  matrix  D  to  obtain  the  sample  tree  of  high-density 
clusters. 


3.2  A  Diagnostic  Plot  for  the  Number  of  Modes 

In  Wong  and  Lane  (1981) ,  it  is  pointed  out  that  for  the  kth  nearest 
neighbor  clustering  procedure  to  be  stongly  set-consistent,   k  has  to  be 
chosen  in  such  a  way  that  k(N)/N  -*   0,   and  k(N)/logN  ^  °°,   as  N  ^  <". 
However,  the  problem  of  choosing  k  in  practice  has  not  been  dealt  with, 
although  it  has  been  suggested  that  a  range  of  values  of  k   should  be 
tried.   Here,  it  is  proposed  that  the  number  of  modes  identified  in  the 
sample  hierarchical  clustering  when  different  values  of  k   are  used  should 
be  plotted  against  k  because  this  plot  is  useful  in  suggesting  the  num- 
ber of  modes  in  the  population. 

It  is  not  difficult  to  see  that  the  value  of  k   controls  the  amount 
by  which  the  data  are  smoothed  to  give  the  density  estimate  on  which  the 
clustering  procedure  is  based.   When   k   increases  from  1  to  N,  the  density 
estimate  becomes  smoother  or  less  bumpy;  that  is,  the  number  of  identified 
modes  is  a  non-increasing  function  of  k.   (This  result  is  proved  in 
Silverman  (1981)  for  the  kernel  density  estimate.)   Hence,  the  plot  of 
"number  of  estimated  modes"  against   k  will  show  a  non-increasing  step 
function;  and  it  is  expected  that  when  the  number  of  estimated  modes 
reaches  the  true  number  of  modes,  it  will  be  stable  over  a  range  of  values 
of  k.   The  results  of  a  Monte  Carlo  study  performed  to  examine  the 
effectiveness  of  this  diagnostic  plot  will  be  reported  next. 

3. 3  Empirical  Illustrations  of  the  Diagnostic  Plot 

Sixteen  experiments  were  run  using  data  generated  from  various  normal 
distributions  and  mixtures  thereof.   The  four  diagnostic  plots  shown  in 


Figure  A  are  obtained  for  two  samples  of  size  50  and  two  of  size  100  that 
are  generated  according  to  the  univariate  unimodal  standard  normal  distri- 
bution,  N(0,1),   while  those  shown  in  Figure  B  are  obtained  for 
corresponding  samples  generated  according  to  the  bivariate  unimodal  normal 
distribution,   BVN  [(0,0),  (f^-,)]-   In  all  of  these  plots,  a  very  extensive 
plateau  can  be  observed  where  the  number  of  identified  modes  is  1,  while 
no  other  stable  number  of  modes  is  indicated. 

Figures  C  and  D  show  some  interesting,  yet  disturbing  features  of  the 
proposed  diagnostic  plot.   Although,  as  can  be  expected  of  samples  from 
bimodal  distributions,  all  of  the  plots  show  a  wide  range  of  stability 
for  bimodality,  some  of  the  plots  also  show  stable  plateaus  for  trimodality 
(see  Figures  C(a2)  ,  C(b2)  ,  and  D(b2)).   Since  each  of  the  samples  used  to 
obtain  Figure  C(bl)  and  C(b2)  consists  of  30  observations  from   N(0,1) 
and  70  observations  from  N(8,4),   it  is  unreasonable  to  expect  the  number 
of  identified  modes  to  be  greater  than  1  when  k  is  greater  than  30. 
Hence,  the  relatively  short  bimodality  plateau  shown  in  Figure  C(b2)  is  not 
unexpected.   However,  the  diagnostic  plots  shown  in  Figures  C(bl)  and  C(b2) 
also  show  that  two  different  samples  from  the  same  distributions  can  give 
plateaus  of  fairly  different  widths. 

It  is  difficult  to  account  for  the  trimodality  plateau  that  is  evident 
in  Figure  C(a2)  ,  but  at  least  in  this  case  it  is  significantly  narrower 
than  the  very  stable  bimodality  plateau.   For  Figure  D(b2),  a  look  at  the 
corresponding  scatterplot  (Figure  E)  suggests  that  the  appearance  of  a 
sizeable  trimodality  plateau  in  the  diagnostic  plot  is  not  unreasonable. 

In  this  section,  we  have  sho^m  that  the  proposed  diagnostic  plot  is 
useful  in  indicating  the  number  of  modes  that  are  present  in  a  population. 


It  is  also  useful  in  suggesting  the  possible  existence  of  finer  sub- 
populations.   It  is  however,  sensitive  to  the  sample  sizes  from  different 
subpopulations ,  but  only  in  as  much  as  they  impose  upper  bounds  on  the 
width  of  the  plateaus.   On  the  whole,  the  proposed  plot  seems  to  be  a 
valuable  diagnostic  tool  for  assessing  multimodality . 


4.   A  TEST  STATISTIC  FOR  TESTING  THE  MULTIMODALITY  OF  A 
UNIVARIATE  DENSITY   f 

4.1  The  Test  Statistic 

Investigation  of  the  number  of  modes  or  maxima  in  a  density  has  been 
considered  by  several  authors,  for  example  Good  and  Gaskins  (1980)  and 
Silverman  (1981).   As  remarked  by  Silverman  (1981),  it  is  unfortunate 
that  most  of  the  proposed  methods  seem  to  depend  on  some  arbitr- 
ary implicit  or  explicit  choice  of  the  scale  of  the  effects  being  studied. 
, The  simple  approach  based  on  the  kth  nearest  neighbor  clustering  pro- 
cedure described  in  this  paper  has  the  virtue  of  making  this  choice  in  an 
automatic  and  natural  way. 

A  possible  test  statistic  for  hypotheses  concerning  the  number  of 
modes  in  a  univariate  density   f   can  be  obtained  by  applying  the  kth  nearest 
neighbor  clustering  procedure  to  the  sample  data  from  f.   Now,  the  value 
of  k   controls  the  amount  by  which  the  data  are  smoothed  to  obtain  the 
density  estimate  on  which  the  clustering  procedure  is  based.   Therefore, 
for  example,  if  the  data  are  strongly  bimodal,  a  large  value  of   k  will 
be  needed  to  give  a  sample  hierarchical  clustering  with  only  one  mode. 
Suppose  that  we  wish  to  test  the  null  hypothesis  that  the  density   f   under- 
lying the  data  has   M  modes,  against  the  alternative  that   f   has  more 


k   .   =  inf  {k;  f(.,  k)  has  at  most   M   modes}   where   f(-,  k)   is 
crit 

the  density  estimate  obtained  by  the  kth  nearest  neighbor  procedure. 


4,2   Assessing  the  Significance  Level  P   of  a  Sample   Value  of  k 


k   for  testing  H   :  f  has   M  modes  against  H,  :  f  has  more  than   M 
o  ''   o  "A 

modes.   Our  aim  is  to  estimate  the  observed  significance  level 


P  =  P   {k   .   >  k   I  H   is  true] 
r   crxt    o  '   o 


so  that  we  can  reiect  H   when   P   is  sufficiently  small.   It  is  shown 

below  how  an  estimate  of   P   can  be  obtained  by  using  a  bootstrap  procedure 

(See  Efron,  1979). 

To  obtain  a  conservative  estimate  of  P,   an  appealing  choice  for  the 

null  distribution   f  ,   from  which  simulated  samples  are  to  be  taken,  is 

the  density  estimate  obtained  when  k   is  used  as  the  value  of  the  para- 
^  o  ^ 

meter  k,  scaled   to  have  variance  equal  to  the  sample  variance   S~   of  the 

data.   For  univariate  data,  it  is  easv  to  simulate  from  f   bv  using  the 
'  '  o   ' 

bootstrap  method.   As  pointed  out  in  Efron  (1979),   N   independent  observa- 
tions from  f   are  given  by 


d-  (x^(.^)  yili 


1  +  — -^ (X^(,)  +  d^^  ()4(.))Pi[-i,i]) 

3s      /  o 


where   X^  /  •  \  SlX&   sampled  uniformly,  with  replacement,  from  the  data 

2 
X  ,  Xj ,  ...,  X^;   s    is  the  sample  variance  of  the  data,   d   (X  , ..)   is 

o 
the   k  th   nearest  neighbor  distance  of  observation   X  ,.,,   and   y.[-l,l] 

is  an  independent  sequence  of  uniform  random  variables  distributed  between 

-1  and  +1.   And  the  value  of   P   can  then  be  estimated  by  finding  the  pro- 
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portion  of   R  bootstrap  samples  of  size  N  which  give  values  of  ^^^^ 
greater  than   k  . 

The  computational  procedure  can  be  summarized  as  follows: 


2 
Step  0:   Compute   s    and  find  k^   for  the  sample  data. 


Step  1:   For  i  =  1   to  N , 

sample  with  replacement  from   {1,  2,  ...,  N}; 
let   I(i)   be  the  ith  pick,  and  let 


i  (^(i))  \-^/2 


^hii)^\  ^hu)^   u,[-l,l]). 


Step   2;      Apply    the  kth  nearest  neighbor   clustering  procedure, 

with      k   =   k    ,       to    the  bootstrapped    data      y,,    y^,...,y^. 

Test   if    the  number   of  sample-modes      SM     is    greater   than 
M. 

Step    3:       Repeat    steps    1    and   2    R  times    (we   will    use    R=120) , 


Step    A:      Let    the    estimate   of      P 


#    times    that    (SM   >   M) 
120 


Then,  H  is  accepted  at  the  5%  level  if  the  p-value  p  is  greater 
than    0.05. 

It  should  be  borne  in  mind  that  this  test  is  very  conservative  as  it 
uses    the   most    extreme      k      that    yields     M-modality    for    the    sample 
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X,,  x^,  ...,  x^. 

We  applied  the  above  test  to  various  univariate  normal  distributions 

and  mixtures  thereof.   Twentyfive  samples  of  size  100  were  taken  for  each 

distribution  studied,  the  results  of  the  test  for  various  null  hypotheses 

(one,  two,  and  three  modes)  are  given  in  Table  1  below;  they  consist  of 

values  of   k    (the  value  of  k   .   obtained  from  the  sample)  and  the  corres- 
o  crit  '^ 

ponding  estimates  of   P.   Note  that  these  results  must  of  course  be 
interpreted  as  a  hierarchical  set  of  significance  tests:   If  (M-1)  - 
modality  is  not  rejected  by  the  test,  then  there  is  no  point  in  testing 
for  M-modality.   So  we  should  test  successively  for  an  increasing  number 
of  modes  until  we  find  a  number  that  is  accepted.   In  the  following 
discussion,  we  will  use  a  significance  level  of  5%. 

Table  1(a)  shows  that  none  of  the  25  samples  fron   N(0,1)   leads  to 
a  rejection  of  a  unimodal  null  hypothesis.      Equally  encouraging  are 
the  results  for  the  fifty-fifty  mixture  of  N(0,1)   and  N(4,l);   bimodality 
is  rejected  only  once  out  of  twenty-five  samples;   Moreover,  the  empirical 
power  of  testing  "H  :   the  distribution  is  unimodal",  against  "H  :   the 
distribution  is  bimodal"  for  samples  from  this  mixture  is  very  good:   92% 

For  the  trimodal  mixture  in  Table  1(b)   (25  observations  from  N(0,1), 
25  from  N(4,l)   and  50  from  N(8,l)),  the  test  fails  to  reject  unimodality 
in  21  cases  out  of  25;  and  in  two  out  of  the  remaining  four  cases,  bi- 
modality cannot  be  rejected.   The  reason  for  the  poor  performance  of  the  pro- 
posed test  for  this  mixture  is  thought  to  lie  primarily  in  the  small  (25  ob- 
servations) and  uneven  (25/25/50)  subsample  sizes.   Indeed  the  density  esti- 
mate turns  unimodal  for  k   around  25  because  of  the  small  subsample  sizes. 


but  for  k  =  25   (small  with  respect  to  the  sample  size  of  100),  the 
o 

density  estimate  is  still  very  sensitive  to  perturbations  around  the  sam- 
ple points,  and  as  perturbations  are  exactly  what  the  bootstrap  does,  the 
bootstrapped  sample  is  most  likely  to  be  multimodal  for  k  ;  hence  the 
test  tends  to  accept  unimodality.   It  should  again  be  pointed  out  that  the 
proposed  test  is  hierarchical  in  nature,  and  there  is  little  point  in 
testing  for  bimodality  if  unimodality  cannot  be  rejected. 

For  the  results  in  Table  1(c), (25  observations  from  N(0,1),  50  for 
N(4,l),   25  from  N(8,l)),   unimodality  cannot  be  rejected  for  any  of 
the  25  samples,  so  indeed  we  have  a  very  conservative  test. 
[Table  1  about  here] 

The  proposed  test  statistic  has  been  shown  to  perform  well  in  one 
dimension  for  truly  unimodal  distributions  and  for  bimodal  distributions 
with  nicely  separated  modes  of  equal  importance.   It  behaves  comparatively 
poorly  when  the  subsample  sizes  are  small  and /or  uneven.   In  fact,  it  is 
more  conservative  than  expected,  and  needs  to  be  improved  if  it  is  to  be 
a  sharp  testing  tool;  especially  since  its  computational  expenditure  is 
non-negligible  (on  the  average,  about  one  hour  of  CPU-time  is  consumed  on 
a  Prime  850,  for  a  program  that  tests  for  1,2,3,  and  4  modes  using  a 
sample  of  size  100,  i.e.,  roughly  a  quarter  of  an  hour  of  CPU-time  per 
null  hypothesis  tested).   Moreover,  although  the  proposed  test  statistic 
is  also  well-defined  for  multivariate  data,  the  bootstrap  procedure  de- 
scribed above  for  estimating  the  p-value  of  a  sample  test  statistic  cannot 
be  easily  generalized  to  several  dimensions.   Hence,  much  work  has  yet 
to  be  done  to  develop  an  appropriate  generalization  of  this  testing  pro- 
cedure. 


5.   ILLUSTRATIVE  EXAMPLES 

In  this  section,  the  effectiveness  of  the  proposed  diagnostic  plot 
and  the  testing  procedure  are  illustrated  with  real  examples. 
The  real  univariate  data  sets  used  are: 

(1)  the  chondrite  data  from  Good  and  Gaskins  (1981),  22  observa- 
tions. 

(2)  the  petal  lengths  of  Fisher's  Iris  data,  (Fisher,  1936),  for  two 
Iris  species  (setosa  and  versicolor),  100  observations,  (2x50) 

(3)  the  petal  lengths  of  Fisher's  Iris  data  for  three  Iris  species 
(Setosa,  versicolor  and  virginica) ,  150  observations  (3x50). 

-  Data  Set  (1) 

We  have  analyzed  the  data  which  consist  of  the  distribution  of  silica 
in  22  chrondrite  meteors;  this  data  has  been  studied  previously,  among 
others,  by  Good  and  Gaskins  (1981),  and  Silverman  (1981). 


Percentages   y   Silica  in  22  Chondrites 


y 

20.77 

22.56 

22.71 

22.99 

26.39 

27.08 

y 

27.32 

27.33 

22.57 

27.81 

28.69 

29.36 

y 

30.25 

31.89 

32.88 

33.23 

33.28 

33.40 

y 

33.52 

33.83 

33.95 

34.82 
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The  diagnostic  plot  (Figure  F(a))  reveals  only  one  very  stable 
plateau  for  unimodality.   Small  plateaus  are  also  detected  for  two  and 
three  modes. 

The  testing  procedure  developed  in  Section  4  yields  the  following 


H 


unimodal     8  0.067 

bimodal      5  0.677 

trimodal     2  0.833 

Consequently,  we  cannot  reject  unimodality  at  the  5%  level.   (Note  that 
we  can  at  the  10%  level,  in  which  case  we  accept  biraodality  of  the  popu- 
lation).  We  cannot  accept  trimodality  exclusive  of  uni-  or  biraodality, 
which  is  not  surprising  considering  the  small  number  of  observations;  they 
could  be  sampled  from  any  distribution.   We  do  find  a  finer  trimodal 
substructure,  indicated  by  the  diagnostic  plot,  but  no  more  than  an 
indication  of  it  (see  also  Good  and  Gaskins  (1981),  and  Silverman  (1981) 
whose  conclusion  is  questionable). 

-Date  set  (2)  consists  of  the  petal  lengths  of  two  iris  species,  setosa 
and  versicolor. 

The  diagnostic  plot  (Figure  F(b))  reveals  a  stable  plateau  at  2  modes, 
but  also  suggests  small  three-  and  four-mode  plateaus.   It  seems  to 
indicate  a  basic  bimodal  population  with  possibly  some  finer  substructures 
(small  additional  modal  regions). 


When  we  tested  for  various  numbers  of  modes,  we  obtained  the  follow- 
ing results : 


\ 

k 

0 

50 

estimated 

^^^crit  >  ^o) 

unimodal 

0.000 

bimodal 

19 

0.025 

trimodal 

13 

0.017 

quadimodal 

7 

0.583 

Hence  we  reject  the  first  three  null  hypotheses  at  the  5%  level,  and 
accept  quadrimodality.   It  is  known  that  the  petal  lengths  between  the 
two  species  are  very  different,  but  the  test  seems  to  indicate  that  the 
distribution  might  have  four  modes. 

-Data  set  (3)  includes  three  species  of  iris  (setosa,  versicolor  and 
virginica) • 

The  diagnostic  plot  (Figure  F(c))  shows  a  stable  two-mode  plateau,  and 
also  a  relatively  stable  plateau  at  five  modes;  however,  the  plateaus  found 
for  two  and  three  modes  shown  in  Figure  F(b)  have  virtually  disappeared. 
The  test  statistic  also  yields  an  altogether  different  picture. 


unimodal 

51 

0.750 

bimodal 

19 

0.325 

trimodal 

16 

0.108 

q-modal 

14 

0.008 
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The  test  does  not  reject  unimodality.   Now,  it  is  known  that  the  Iris 
Setosa  species  is  very  different  from  the  other  two  which  are  not  distinct 
from  one  another;  so  why  does  the  test  not  reject  unimodality?   The  main 
culprit  seems  to  be  the  fact  that  the  two  modes  one  expects  to  find  are 
of  uneven  sizes,  and  the  test  is  very  sensitive  to  uneven  subsample  sizes 
as  seen  earlier  on. 
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FIGURE  A 
Diagnostic  Plots:   Normal  Distributions 


(al)  and  (a2) :   50  observations  from  N(0,1) 
(bl)  and  (b2):   100  observations  from  N(0,1) 
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Diagnostic   Plots:      Normal   Distributions 
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FIGURE  C 

Diagnostic  Plots:   Normal  Mixtures 

(al)  and  (a2)  :   50  observations  from  N(0,1)  and  50  from  N(5,l) 
(bl)  and  (b2):   30  observations  fromN(0,l)  and  70  from  N(8,4) 
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Diagnostic  Plots:   Normal  Mixtures 
(al)  and  (a2)  :   50  observations  from  BVN  [  (0  ,0)  ,  (q-j^)  ]  and  50  from 
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90 
B\'N[10,0),(q^)] 


*  o                       o 

*****  Q 

*  2  **  ooooeo       oooo  oo 

'23  **  oo  o  o  o3   o   0  4   oo  o  2  o 

*3*   ***  oo2o2   23   o22oo         ( 

'2**42*  o  o               0    3    2      o      oo22o   o        o 

***  o                0OO02          22              oo 

**                       o  oo                o  o              oo 

'   *        *  oooo                           o 


xy-Plots:   Normal  Mixtures 
50  observations  from  Bv'N  [  (0,0)  ,  (JJ)  and  100  f  rom  BVN[(10,0)  ,  (^°)  ] 


Ft  ff  " 


FIGURE  F 
Diagnostic  Plots 


(a)  chondrite  data  (1) 

(b)  iris  data,  2  species  (2) 

(c)  iris  data,  3  species  (3) 
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